Decision trees and Ensemble Methods

DDD: Elements of Statistical Machine Learning & Politics of Data

Ayush Patel

At Azim Premji University, Bhopal

13 Feb, 2026

Hello

I am Ayush.

I am a researcher working at the intersection of data, development and economics.

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Did you come prepared?

  • You have installed R. If not see this link.

  • You have installed RStudio/Positron/VScode or any other IDE. It is recommended that you work through an IDE

  • You have the libraries {tree} {gbm} {randomForest} {BART} installed

Learning Goals

  1. What are decision trees?
  2. When to use them?
  3. How do these work?
  4. Application and interpretation.
  5. Improving decision trees using ensemble methods.

Decision Trees


Supervised and non-parametric

Can be used for Classification and Regression

A predictor space is cut into segments and mean response of training observations is used as the estimate of response of test observations.

Simple, easy to interpret but not the best for prediction accuracy on its own

Prediction accuracy can be improved by ensemble methods

Intuition - How it works

Can you Identify regions with similar salary range?

Intuition - How it works

Can you Identify regions with similar salary range?

Intuition - How it works

Can you Identify regions with similar salary range?

Intuition - How it works

Can you Identify regions with similar salary range?

Model output - Predictor Space

from ISLR

Model output - Tree

from ISLR

Region representation



\(R1 = \left\{X|Years<4.5\right\}\)

\(R2 = \left\{X|Years>=4.5,Hits<117.5\right\}\)

\(R2 = \left\{X|Years>=4.5,Hits>=117.5\right\}\)

Tree terminology

from ISLR

How should we carry out segmenting of predictor space?

Segmenting - Theory

Predictor Space of \(p\) variables needs to be segmented into \(J\) different regions.

In theory, the regions can be of any shape, however high-dimensional rectangles are chosen in practice for computational ease and interpretability

For every observation in region \(R_j\), we make the same prediction. Mean or mode of the training observations in region \(R_j\)

Minimize: \(\sum_{j=1}^{J}{\sum_{i \in R_j}{(y_i - \hat{y}_{R_j})^2}}\)

But

Not easy to consider all possible cutpoints for all possible predictors with all possible sequences

We use high-dimensional rectangles instead of any shape for ease of interpretation.

So, use Recursive Binary splitting

top-down

greedy

top-down



We brgin at the point where all observations are part of the same region. Hence the name top-down

Greedy



Best split at a particular step. We do not care about the future. A predictor \(p\) and a cutpoint \(s\) is chosen based on which split will lead to the lowest RSS.

This is carried out recursively, over and over again.

Formally

We aim to minimize the following at every step

\(\sum_{i:x_i \in R_1 (j,s)}{(y_i - \hat{y}_{R_1})^2} + \sum_{i:x_i \in R_2 (j,s)}{(y_i - \hat{y}_{R_2})^2}\)

Fitting Regresison Trees

tree(body_mass_g ~ ., data = penguins) -> peng_mass_tree

peng_mass_tree
node), split, n, deviance, yval
      * denotes terminal node

1) root 333 215300000 4207  
  2) species: Adelie,Chinstrap 214  40430000 3715  
    4) sex: female 107   8493000 3419 *
    5) sex: male 107  13240000 4010 *
  3) species: Gentoo 119  29670000 5092  
    6) sex: female 58   4519000 4680 *
    7) sex: male 61   5884000 5485 *

Fitting Regression Trees

summary(peng_mass_tree)

Regression tree:
tree(formula = body_mass_g ~ ., data = penguins)
Variables actually used in tree construction:
[1] "species" "sex"    
Number of terminal nodes:  4 
Residual mean deviance:  97680 = 32140000 / 329 
Distribution of residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-760.30 -219.20   15.16    0.00  220.30  815.20 

Fitting Regression Trees

plot(peng_mass_tree)
text(peng_mass_tree, pretty = 0)

Fitting Regression Trees

na.omit(penguins) |>
    mutate(
        pred_mass = predict(peng_mass_tree)
    ) |>
        relocate(body_mass_g, pred_mass, everything())
# A tibble: 333 × 9
   body_mass_g pred_mass species island    bill_length_mm bill_depth_mm
         <int>     <dbl> <fct>   <fct>              <dbl>         <dbl>
 1        3750     4010. Adelie  Torgersen           39.1          18.7
 2        3800     3419. Adelie  Torgersen           39.5          17.4
 3        3250     3419. Adelie  Torgersen           40.3          18  
 4        3450     3419. Adelie  Torgersen           36.7          19.3
 5        3650     4010. Adelie  Torgersen           39.3          20.6
 6        3625     3419. Adelie  Torgersen           38.9          17.8
 7        4675     4010. Adelie  Torgersen           39.2          19.6
 8        3200     3419. Adelie  Torgersen           41.1          17.6
 9        3800     4010. Adelie  Torgersen           38.6          21.2
10        4400     4010. Adelie  Torgersen           34.6          21.1
# ℹ 323 more rows
# ℹ 3 more variables: flipper_length_mm <int>, sex <fct>, year <int>